38 research outputs found

    Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model

    Full text link
    This paper considers the modeling and the analysis of the performance of lock-free concurrent data structures. Lock-free designs employ an optimistic conflict control mechanism, allowing several processes to access the shared data object at the same time. They guarantee that at least one concurrent operation finishes in a finite number of its own steps regardless of the state of the operations. Our analysis considers such lock-free data structures that can be represented as linear combinations of fixed size retry loops. Our main contribution is a new way of modeling and analyzing a general class of lock-free algorithms, achieving predictions of throughput that are close to what we observe in practice. We emphasize two kinds of conflicts that shape the performance: (i) hardware conflicts, due to concurrent calls to atomic primitives; (ii) logical conflicts, caused by simultaneous operations on the shared data structure. We show how to deal with these hardware and logical conflicts separately, and how to combine them, so as to calculate the throughput of lock-free algorithms. We propose also a common framework that enables a fair comparison between lock-free implementations by covering the whole contention domain, together with a better understanding of the performance impacting factors. This part of our analysis comes with a method for calculating a good back-off strategy to finely tune the performance of a lock-free algorithm. Our experimental results, based on a set of widely used concurrent data structures and on abstract lock-free designs, show that our analysis follows closely the actual code behavior.Comment: Short version to appear in DISC'1

    Throughput and energy efficiency of lock-free data structures: Execution Models and Analyses

    Get PDF
    Concurrent data structures are key program components to harness the available parallelism in multi-core processors. Lock-free algorithmic implementations of concurrent data structures offer high scalability and possess desirable properties such as immunity to deadlocks, convoying and priority inversion. In this thesis, we develop analytical tools to model and analyze the throughput and energy consumption of concurrent lock-free data structures. We start our study with a general class of lock-free data structures. Then, we target more specialized designs for lock-free queues. Finally, we focus on the search data structures that possess different characteristics compared to previously mentioned data structures. Performance of lock-free data structures: This thesis contributes to the problem of making ends meet between theoretical bounds and actual measured throughput. As the first step, we consider a general class of lock-free data structures and propose three analytical frameworks with different flavors. Analyses of this class also cover efficient implementations of a set of fundamental data structures that suffer from inherent sequential bottlenecks. We model the executions and examine the impact of contention on the throughput of these algorithms. Our analyses lead to optimization methods on memory management and back-off strategies. Performance and energy efficiency of lock-free queues: We take a step further to model the throughput of lock-free operations and their interaction. Considering shared queues, as a key paradigm for data sharing, operations (En- queue, Dequeue) access the opposite ends of a queue. Same type of operations might contend with each other on a non-empty queue. However, all types of operations are subject to interaction when the queue is empty. We first decorrelate the throughput of dequeuers’ and enqueuers’ into several uncorrelated basic throughputs, and reconstruct the main throughputs as a function of these basic throughputs. Besides, we model the power dissipation and integrate it with the throughput estimations to extract the energy consumption of applications that utilize lock-free queues. Performance of lock-free search data structures: Lock-free designs that utilize fine-grained synchronization have produced efficient implementations of search data structures. These designs reveal different characteristics compared to the previous set of lock-free data structures with inherent sequential bottlenecks. We introduce a new way of modeling and analyzing the throughput of search data structures under stationary and memoryless access patterns.

    Performance Analysis and Modelling of Concurrent Multi-access Data Structures

    Get PDF
    The major impediment to scaling concurrent data structures is memory contention when accessing shared data structure access-points, leading to thread serialisation, hindering parallelism. Aiming to address this challenge, significant amount of work in the literature has proposed multi-access techniques that improve concurrent data structure parallelism. However, there is little work on analysing and modelling the execution behaviour of concurrent multi-access data structures especially in a shared memory setting. In this paper, we analyse and model the general execution behaviour of concurrent multi-access data structures in the shared memory setting. We study and analyse the behaviour of the two popular random access patterns: shared (Remote) and exclusive (Local) access, and the behaviour of the two most commonly used atomic primitives for designing lock-free data structures: Compare and Swap, and, Fetch and Add. We model the concurrent multi-accesses by splitting the thread execution procedure into five logical sessions: i) side-work, ii) access-point search iii) access-point acquisition, iv) access-point data acquisition and v) access-point data operation. We model the acquisition of an access-point, as a system of closed queuing networks with parallel servers, and data acquisition in terms of where the data is located within the memory system. We evaluate our model on a set of concurrent data structure designs including a counter, a stack and a FIFO queue. The evaluation is carried out on two state of the art multi-core processors: Intel Xeon Phi CPU 7290 with 72 physical cores and Intel Xeon E5-2695 with 14 physical cores. Our model is able to predict the throughput performance of the given concurrent data structures with 80% to 100% accuracy on both architectures

    Monotonically relaxing concurrent data-structure semantics for performance: An efficient 2D design framework

    Full text link
    There has been a significant amount of work in the literature proposing semantic relaxation of concurrent data structures for improving scalability and performance. By relaxing the semantics of a data structure, a bigger design space, that allows weaker synchronization and more useful parallelism, is unveiled. Investigating new data structure designs, capable of trading semantics for achieving better performance in a monotonic way, is a major challenge in the area. We algorithmically address this challenge in this paper. We present an efficient, lock-free, concurrent data structure design framework for out-of-order semantic relaxation. Our framework introduces a new two dimensional algorithmic design, that uses multiple instances of a given data structure. The first dimension of our design is the number of data structure instances operations are spread to, in order to benefit from parallelism through disjoint memory access. The second dimension is the number of consecutive operations that try to use the same data structure instance in order to benefit from data locality. Our design can flexibly explore this two-dimensional space to achieve the property of monotonically relaxing concurrent data structure semantics for achieving better throughput performance within a tight deterministic relaxation bound, as we prove in the paper. We show how our framework can instantiate lock-free out-of-order queues, stacks, counters and dequeues. We provide implementations of these relaxed data structures and evaluate their performance and behaviour on two parallel architectures. Experimental evaluation shows that our two-dimensional data structures significantly outperform the respected previous proposed ones with respect to scalability and throughput performance. Moreover, their throughput increases monotonically as relaxation increases

    Brief announcement: 2D-stack - A scalable lock-free stack design that continuously relaxes semantics for better performance

    Get PDF
    We briefly describe an efficient lock-free concurrent stack design with tunable and tenable relaxed semantics to allow for better performance. The design is tunable and allow for a continuous monotonic trade of weaker semantics for better throughput performance. Concurrent stacks have an inherent scalability bottleneck due to their single access point for both their operations. Elimination and semantics relaxation have been proposed in the literature to address this problem. Semantics relaxation has the potential to reach monotonically very high throughput by continuously trading relaxation for throughput. Previous solutions could not fully leverage this potential. We suggest a new two dimensional design that can achieve this by exploiting disjoint access parallelism in one dimension and locality in the other within tight accuracy bounds. The behaviour of the algorithm is tightly bound. We compare experimentally to previous work, with respect to throughput and relaxed behaviour observed, on different relaxation and concurrency settings. The experimental evaluation shows that our algorithm significantly outperform all other algorithms in terms of performance, also maintain better accuracy in contrast to other designs with relaxed semantics

    2D-Stack: A scalable lock-free stack design that continuously relaxes semantics for better performance

    Get PDF
    In this report, we propose an efficient lock-free concurrent stack design with tunable and tenable relaxed semantics to allow for better performance. The design is materialized by a shared memory distributed stack design that allow for a continuous monotonic trade of weaker semantics for better throughput performance. Concurrent stacks have an inherent scalability bottleneck due to their single access point for both push and pop operations.Elimination and semantics relaxation have been proposed in the literature to address this problem. Semantic relaxation has the potential and flexibility to reach monotonically very high throughput. Previous solutions could not fully leverage this potential. We propose a new two-dimensional design that can achieve this by exploiting disjoint access parallelism in one dimension and locality in the other. This is achieved through distributing the stack in form of sub-stacks that are accessed independently in parallel. Load balancing is used to keep a balanced number of operations on individual sub-stacks. We also provide tight relaxation bounds for the behaviour of our algorithm. We compare experimentally to previous work, with respect to throughput and relaxed behaviour observed, on different relaxation and concurrency settings. The results show that our algorithm signicantly outperform all other algorithms in terms of performance, while maintaining better quality in contrast to other designs with relaxed semantics

    Modeling and Analyzing the Performance of Lock-Free Data Structures

    Get PDF
    Abstract This paper considers the modeling and the analysis of the performance of lock-free concurrent data structures. Lock-free designs employ an optimistic conflict control approach, allowing several processes to access the shared data object at the same time. The operations on these data structures are typically designed as compositions of retry loops. Our main contribution is a new way of modeling and analyzing a general class of lock-free algorithms, achieving predictions of throughput that are close to what we observe in practice. In our model we introduce two key metrics that shape the performance of lock-free algorithms: (i) expansion in execution time of a retry due to memory congestion and (ii) number of wasted retries. We show how to compute these two metrics, and how to combine them, to calculate the throughput of an arguably large class of lock-free algorithms. Our analysis also captures the throughput performance of a lock-free algorithm when executed as part of a larger parallel application. This part of our analysis leads to an analytical method for calculating a good back-off strategy to finely tune the performance of a lock-free application. Our experimental results, based on a set of widely used concurrent data structures and on abstract lock-free designs, show that our analysis follows closely the actual code behavior. To the best of our knowledge, this is the first attempt to make ends meet between theoretical bounds on performance and actual measured throughput

    White-box methodologies, programming abstractions and libraries

    Get PDF
    EXCESS deliverable D2.2. More information at http://www.excess-project.eu/This deliverable reports the results of white-box methodologies and early results ofthe first prototype of libraries and programming abstractions as available by projectmonth 18 by Work Package 2 (WP2). It reports i) the latest results of Task 2.2on white-box methodologies, programming abstractions and libraries for developingenergy-efficient data structures and algorithms and ii) the improved results of Task2.1 on investigating and modeling the trade-off between energy and performance ofconcurrent data structures and algorithms. The work has been conducted on two mainEXCESS platforms: Intel platforms with recent Intel multicore CPUs and MovidiusMyriad1 platform
    corecore